Quantifying Tackeling in Football

A Data-Driven Approach Using the NFL Big Data Bowl Dataset and Advanced Machine Learning Techniques

Dusty Turner

A Quick Reminder

Research Hypothesis


Research Question: Can we determine each defensive player’s probability that they make a tackle on each play on the football field?

Ultimately: Assign a ‘tackles over expected’ value for each player.

Literature Review

Previous NFL Big Data Bowl Competitions

  • 2020: How many yards will an NFL player gain after receiving a handoff?
  • 2021: Evaluate defensive performance on passing plays
  • 2022: Evaluate special teams performance
  • 2023: Evaluate linemen on pass plays

Data

Player & Game Identifiers

  • Game and Play IDs: Unique identifiers for games and individual plays
  • Player Information: Names, jersey numbers, team, position, physical attributes, college

In-Game Player Movements

  • Spatial Data: Player positions, movement direction, speed, and orientation
  • Time and Motion: Specific moments in play, distance covered

Detailed Play Information

  • Play Attributes: Description, quarter, down, yards needed
  • Team & Field Position: Possessing team, defensive team, yardline positions

Scoring and Game Probabilities

  • Scores & Results: Pre-snap scores, play outcomes
  • Probabilities: Win probabilities for home and visitor teams
  • Expected Points: Points added or expected by play outcomes

Tackles, Penalties, and Formations

  • Tackles & Fouls: Tackles, assists, fouls committed, and missed tackles
  • Ball Carrier Info: Identifiers and names of ball carriers
  • Team Formations: Offensive formations and number of defenders

Feature Development

Feature Development

Feature Development

Modeling Overview

Rows: 393,536
Technique: Group Splitting

Factors to Consider:
- Tackle (0/1)
- Future X/Y
- S/A/O/Dir of defender
- Position / Alignment cluster Interaction
- Number of Defenders in the Box
- Current and future (.5 seconds) location of the ball
- O/S/A/Dir of ball carrier
- Velocity/direction difference
- Ball in defensive players ‘fan’

Concerns:
- Computational time
- Limited tuning parameters
- Limited data for train/test/validation

Modeling Overview

  • Penalized Regression: {GLMNET}
    • Train:
    • Test:
    • Validate:
    • Baseline Accuracy: 92.9%
  • Random Forest: {Ranger}
    • Train: 19426
    • Test: 369996
    • Validate: 4114
    • Baseline Accuracy: 92.9%
  • XGBoost: {XGBoost}
    • Train: 19426
    • Test: 369996
    • Validate: 4114
    • Baseline Accuracy: 92.9%
  • Neural Network: {Reticulate} (Python Tensor Flow)
    • Train:
    • Test:
    • Validate:
    • Baseline Accuracy: 92.94%

Penalized Regression

\[\text{Minimize } \left\{ \frac{1}{N} \sum_{i=1}^{N} (y_i - \mathbf{x}_i^T \boldsymbol{\beta})^2 + \lambda \left[ \frac{1 - \alpha}{2} \|\boldsymbol{\beta}\|_2^2 + \alpha \|\boldsymbol{\beta}\|_1 \right] \right\}\]

Penalized Regression

The best parameters are: Lambda = 0.01269 and Alpha = 0.00001 with an accuracy of 90.91%

Penalized Regression

Random Forest

Random Forest

The best parameters are: Mtry = 7, Min_n = 6, and Trees = 278 with an accuracy of 92.87%.

Random Forest

XGBoost

XGBoost

The best parameters are: Trees = 219, Min_n = 9, Tree Depth = 1, Learn Rate = 1.2, Loss Reduction = 24, and Sample Size = 1 with an accuracy of 92.87%.

XGBoost

Neural Network

def build_model(input_shape):
    model = Sequential([
        Dense(64, activation='relu', input_shape=[input_shape], kernel_regularizer=l2(0.001)),
        BatchNormalization(),  # normalizes layer inputs to stabilize and accelerate neural training
        Dropout(0.3),          # randomly deactivates neurons to prevent overfitting
        Dense(64, activation='relu', kernel_regularizer=l2(0.001)),
        BatchNormalization(),  
        Dropout(0.3),
        Dense(64, activation='relu', kernel_regularizer=l2(0.001)),
        BatchNormalization(),  
        Dropout(0.3),
        Dense(1, activation='sigmoid', kernel_regularizer=l2(0.001))  # Apply L2 regularization here
    ])
    model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
    return model


Accuracy: 92.92%

Neural Network

Neural Network

Points Above or Below Expected

\(\sum_{i=1}^{N} (\mathbb{I}_{\text{tackle}_i} - P(\text{tackle}_i))\)

Where:

  1. \(N\) is the total number of plays
  2. \(P(\text{tackle}_i)\) is the probability of a tackle on play \(i\)
  3. \(\mathbb{I}_{\text{tackle}_i}\) is the indicator function which is 1 if a tackle occurred on play \(i\) and 0 otherwise

Points Above or Below Expected

Penalized Regression
Accuracy: 90.91%
Display Name Tackles Over Expected
Talanoa Hufanga 7.10
Jonathan Owens 5.76
Marcus Epps 4.66
Grover Stewart 3.89
Nicholas Morrow −5.06
Divine Deablo −5.55
Christian Kirksey −5.61
Myles Hartsfield −5.91
Penalized Regression
Accuracy: 92.27%
Display Name Tackles Over Expected
Talanoa Hufanga 4.92
Maxx Crosby 3.89
Jonathan Owens 3.87
Cameron Jordan 3.79
Xavier McKinney −2.80
Demario Davis −2.91
Damien Wilson −3.15
Cody Barton −3.76
Penalized Regression
Accuracy: 92.05%
Display Name Tackles Over Expected
Talanoa Hufanga 6.12
Jonathan Owens 4.58
Maxx Crosby 4.16
Cameron Jordan 4.03
Damien Wilson −3.66
Christian Kirksey −3.97
Cody Barton −5.08
Demario Davis −5.57
Neural Net
Accuracy: 92.92%
Display Name Tackles Over Expected
Jonathan Owens 4.79
Jihad Ward 3.71
C.J. Mosley 3.58
Grover Stewart 3.43
Bradley Roby −1.85
Roy Lopez −1.90
Tyrann Mathieu −1.93
Marcus Davenport −2.19